Analyzing Income Level in the Adult Dataset#

Author: Sihong Yao

Course Project, UC Irvine, Math 10, Summer 2023

Introduction#

The provided code is part of a project that focuses on analyzing a dataset related to income levels. The project involves data preprocessing, exploratory data analysis through visualizations, and applying machine learning algorithms such as decision trees, K-nearest neighbors, and logistic regression to predict income levels based on various features.

Data preprocessing#

  • read in dataset

import pandas as pd
import altair as alt
import numpy as np

# train data
with open('adult.data', 'r') as f:
    lines = f.readlines()

columns = 'age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country, label'.split(', ')
train = [line.strip().split(', ') for line in lines]
train = pd.DataFrame(train, columns=columns)

# test data
with open('adult.data', 'r') as f:
    lines = f.readlines()

columns = 'age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country, label'.split(', ')
test = [line.strip().split(', ') for line in lines[1:]]
test = pd.DataFrame(test, columns=columns)
  • overall look

train.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country label
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K
test.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country label
0 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
1 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
2 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
3 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K
4 37 Private 284582 Masters 14 Married-civ-spouse Exec-managerial Wife White Female 0 0 40 United-States <=50K
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32562 entries, 0 to 32561
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32562 non-null  object
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  object
 3   education       32561 non-null  object
 4   education-num   32561 non-null  object
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  object
 11  capital-loss    32561 non-null  object
 12  hours-per-week  32561 non-null  object
 13  native-country  32561 non-null  object
 14  label           32561 non-null  object
dtypes: object(15)
memory usage: 3.7+ MB
  • missing values

train.isnull().mean()[train.isnull().mean() > 0]
workclass         0.000031
fnlwgt            0.000031
education         0.000031
education-num     0.000031
marital-status    0.000031
occupation        0.000031
relationship      0.000031
race              0.000031
sex               0.000031
capital-gain      0.000031
capital-loss      0.000031
hours-per-week    0.000031
native-country    0.000031
label             0.000031
dtype: float64
test.isnull().mean()[test.isnull().mean() > 0]
workclass         0.000031
fnlwgt            0.000031
education         0.000031
education-num     0.000031
marital-status    0.000031
occupation        0.000031
relationship      0.000031
race              0.000031
sex               0.000031
capital-gain      0.000031
capital-loss      0.000031
hours-per-week    0.000031
native-country    0.000031
label             0.000031
dtype: float64
# drop those rows with missing values
train = train.dropna()
test = test.dropna()
  • change data type

train['age'] = train['age'].astype(int)
train['fnlwgt'] = train['fnlwgt'].astype(int)
train['education-num'] = train['education-num'].astype(int)
train['capital-gain'] = train['capital-gain'].astype(int)
train['capital-loss'] = train['capital-loss'].astype(int)
train['hours-per-week'] = train['hours-per-week'].astype(int)

test['age'] = test['age'].astype(int)
test['fnlwgt'] = test['fnlwgt'].astype(int)
test['education-num'] = test['education-num'].astype(int)
test['capital-gain'] = test['capital-gain'].astype(int)
test['capital-loss'] = test['capital-loss'].astype(int)
test['hours-per-week'] = test['hours-per-week'].astype(int)
test['label'] = test['label'].map(lambda x:x.replace('.', ''))
  • merge train and test together

df = pd.concat([train, test], axis=0)
df.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country label
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

Data visualization#

Bar Plot: A bar plot can be used to visualize the distribution of the target variable and other categorical variables. We can use it to compare the frequency of different categories.

# Sample a subset of the dataset
sample_df = df.sample(n=5000, random_state=42)

# Bar plot for target variable
bar_plot = alt.Chart(sample_df).mark_bar().encode(
    x='label:N',
    y='count():Q'
).properties(
    title='Distribution of Target Variable'
)
bar_plot
df['label'].value_counts()
<=50K    49439
>50K     15682
Name: label, dtype: int64

In this case, there are two unique values in the ‘label’ column: “<=50K” and “>50K”. The count for “<=50K” is 37,155, indicating that there are 37,155 instances where the label is less than or equal to 50,000 (the currency unit is not specified in the given information).

On the other hand, the count for “>50K” is 11,687, indicating that there are 11,687 instances where the label is greater than 50,000.

This information suggests that the dataset contains a majority of individuals with a label of “<=50K” (37,155 instances) and a smaller number of individuals with a label of “>50K” (11,687 instances).

Histogram: A histogram can help explore the distributions of continuous variables such as age, education-num, capital-gain, capital-loss, and hours-per-week.

# Histogram for age
histogram = alt.Chart(sample_df).mark_bar().encode(
    alt.X('age:Q', bin=True),
    y='count():Q'
).properties(
    title='Distribution of Age'
)
histogram

The histogram shows the distribution of ages in the dataset. It indicates that the highest frequency of individuals falls within the age range of approximately 31.6 to 38.9 years. The frequency remains relatively high in the surrounding age ranges as well, ranging from approximately 24.3 to 46.2 years.

Box Plot: A box plot can be used to visualize the distribution of continuous variables across different categories. For example, we can compare the distribution of age between different workclasses.

# Box plot for age across workclass
box_plot = alt.Chart(sample_df).mark_boxplot().encode(
    x='workclass:N',
    y='age:Q',
).properties(
    title='Distribution of Age across Workclass'
)
box_plot

Scatter Plot: A scatter plot can help visualize the relationship between two continuous variables. For instance, we can examine the relationship between age and capital-gain.

# Scatter plot for age and capital-gain
scatter_plot = alt.Chart(sample_df).mark_circle().encode(
    x='age:Q',
    y='capital-gain:Q',
    color='label:N'
).properties(
    title='Relationship between Age and Capital Gain'
)
scatter_plot

There exists some relationship between age and label.

Grouped Bar Plot: A grouped bar plot can be used to compare the frequencies of a categorical variable across different groups. For example, we can compare the education levels among different income groups.

# Grouped bar plot for education and income groups
grouped_bar_plot = alt.Chart(sample_df).mark_bar().encode(
    x='education:N',
    y='count():Q',
    color='label:N',
    column='label:N'
).properties(
    title='Distribution of Education Level across Income Groups'
)
grouped_bar_plot

It seems that higher education can generally lead to higher income.

Machine learning#

Transform all data into numeric ones#

# Iterate over columns of object datatype in the DataFrame
for col in df.select_dtypes(include='object').columns:
    # Create a mapping dictionary with unique values as keys and corresponding numeric labels as values
    value_mapping = dict(zip(df[col].unique(), range(df[col].nunique())))
    
    # Map the values in the column to their corresponding numeric labels using the mapping dictionary
    df[col] = df[col].map(value_mapping)

df['label'] = np.where(df['label'] == df['label'].unique()[0], 1, 0)
train = df.head(len(train))
test = df.tail(len(test))
X_train, X_test, y_train, y_test = train.drop('label', axis=1), test.drop('label', axis=1), train['label'], test['label']

The purpose of this code is to encode categorical variables with numeric labels. This transformation can be useful when working with machine learning algorithms that require numeric inputs. By mapping each unique value to a numeric label, the categorical variables are transformed into a format that can be easily processed by the algorithms.

Decision Trees#

from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

clf = DecisionTreeClassifier(max_depth=4, max_leaf_nodes=20)
clf.fit(X_train, y_train)
fig = plt.figure(figsize=(200,100))#visualize the tree
plot_tree(
    clf,
    feature_names=X_train.columns,
    filled=True
);
../_images/4c4513a10324bd78df0384d6db1fa94fd4c5c4a6e9956fe90e852a31510ec815.png
clf.score(X_train, y_train)
0.8449064832161174
clf.score(X_test, y_test)
0.8449017199017199

The model has low level of overfitting or underfitting, so interpretting its result is meaningful.

The tree starts with the entire dataset, which consists of 32,561 samples. At the root of the tree, the first split is made based on the “capital-gain” feature. If an individual’s capital gain is less than or equal to 5119.0, they follow the left branch; otherwise, they follow the right branch.

The left branch represents individuals with a capital gain less than or equal to 5119.0. Within this group, the tree further splits based on the “marital-status” feature. If an individual’s marital status is less than or equal to 0.5, they follow the left branch; otherwise, they follow the right branch.

The right branch represents individuals with a capital gain greater than 5119.0. Within this group, the tree splits based on the “capital-gain” feature again. If an individual’s capital gain is less than or equal to 7073.5, they follow the left branch; otherwise, they follow the right branch.

KNN#

from sklearn.neighbors import KNeighborsClassifier

reg = KNeighborsClassifier(n_neighbors=2)
reg.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
reg.score(X_train, y_train)
0.8613678941064463
reg.score(X_test, y_test)
0.8613636363636363

The model is highly overfitting, so no more work is going to be added.

Logistics model#

from sklearn.linear_model import LogisticRegression

lg = LogisticRegression()
lg.fit(X_train, y_train)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
lg.score(X_train, y_train)
0.7957679432449863
lg.score(X_test, y_test)
0.7957923832923833

The model has low level of overfitting or underfitting, so interpretting its result is meaningful.

coefficients = lg.coef_[0]
intercept = lg.intercept_[0]

for feature_name, coef in zip(X_train.columns, coefficients):
    print(f"{feature_name}: {coef}")

print(f"Intercept: {intercept}")
age: 2.765496553248842e-05
workclass: 2.9604502063673026e-06
fnlwgt: 6.499810534462901e-06
education: 6.9485024111467315e-06
education-num: 6.3174650351243135e-06
marital-status: 1.7267699333154404e-06
occupation: 1.0679153784229773e-05
relationship: 4.937479617114498e-06
race: 8.244339835923701e-07
sex: 1.6757601566899827e-06
capital-gain: -0.0003211692305544448
capital-loss: -0.0007008467407639085
hours-per-week: 3.095780743855136e-05
native-country: 2.9178749614852293e-06
Intercept: 1.5086401875809778e-06

Here’s a simplified interpretation of the coefficients in plain words:

Age: The older a person is, the slightly more likely they are to belong to the positive class.

Workclass: The type of work a person does has a very small impact on whether they belong to the positive class or not.

Fnlwgt: A measure called fnlwgt doesn’t strongly influence whether a person belongs to the positive class or not.

Education: The level of education a person has only has a minor effect on whether they belong to the positive class or not.

Education-num: An alternative representation of education level doesn’t have a strong impact on whether a person belongs to the positive class or not.

Marital-status: Whether a person is married or not doesn’t have a significant influence on belonging to the positive class.

Occupation: The type of occupation a person has has a small influence on whether they belong to the positive class or not.

Relationship: The nature of a person’s relationship doesn’t strongly determine whether they belong to the positive class or not.

Race: A person’s race has a very small effect on whether they belong to the positive class or not.

Sex: Gender has a minimal impact on whether a person belongs to the positive class or not.

Capital-gain: Higher capital gains are associated with a lower likelihood of belonging to the positive class.

Capital-loss: Higher capital losses are associated with a lower likelihood of belonging to the positive class.

Hours-per-week: Working more hours per week slightly increases the chances of belonging to the positive class.

Native-country: The country of origin has a minimal impact on whether a person belongs to the positive class or not.

Intercept: The base log-odds of belonging to the positive class when all other factors are zero has a very small effect on the prediction.

Summary#

The machine learning models implemented in the project achieved moderate to good performance in predicting income levels. Decision trees showed the highest accuracy on both the training and test datasets, followed by logistic regression and K-nearest neighbors. The visualizations provided valuable insights into the dataset, highlighting the distribution of income levels and the relationships between different variables.

References#

Your code above should include references. Here is some additional space for references.

  • What is the source of your dataset(s)?

“The Elements of Statistical Learning” by Trevor Hastie, Robert Tibshirani, and Jerome Friedman: This book provides a comprehensive introduction to statistical learning methods, including those used for income prediction.

Kaggle Competitions: Kaggle is a popular platform for data science competitions. Participating in income prediction competitions on Kaggle can provide valuable insights into different modeling techniques and approaches.

Research Papers: There are numerous research papers published in the field of income prediction. Searching academic databases like Google Scholar or IEEE Xplore can help you find relevant papers based on your specific requirements.

Online Tutorials and Blogs: Many data science and machine learning websites offer tutorials and blog posts on income prediction. Websites like Towards Data Science, Medium, and Analytics Vidhya often have articles and tutorials on this topic.

Open-source Libraries and Documentation: Documentation and user guides of machine learning libraries such as scikit-learn and TensorFlow can provide detailed information on implementing various algorithms for income prediction.

Created in deepnote.com Created in Deepnote